Members
Overall Objectives
Research Program
Application Domains
Highlights of the Year
New Software and Platforms
New Results
Bilateral Contracts and Grants with Industry
Partnerships and Cooperations
Dissemination
Bibliography
XML PDF e-pub
PDF e-Pub


Section: New Results

Combining clustering of variables and feature selection using random forests: the CoV/VSURF procedure

The following result has been obtained by M. Chavent, and J. Saracco in collaboration with R. Genuer.

High-dimensional data classification is a challenging problem. A standard approach to tackle this problem is to perform variables selection, e.g. using stepwise procedures or LASSO approches. Another standard way is to perform dimension reduction, e.g. by Principal Component Analysis (PCA) or Partial Least Square (PLS) procedures. The approach proposed in this paper combines both dimension reduction and variables selection. First, a procedure of clustering of variables (CoV) is used to built groups of correlated variables in order to reduce the redundancy of information. This dimension reduction step relies on the R package ClustOfVar which can deal with both numerical and categorical variables. Secondly, the most relevant synthetic variables (which are numerical variables summarizing the groups obtained in the first step) are selected with a procedure of variable selection using random forests (VSURF), implemented in the R package VSURF. Numerical performances of the proposed methodology called CoV/VSURF are compared with direct applications of VSURF or random forests (RF) on the original p variables. Improvements obtained with the CoV/VSURF procedure are illustrated on two simulated mixed datasets (cases n>p and n<<p) and on a real proteomic dataset.